Cone beam computed tomography in the assessment of TMJ deformity in children with JIA: repeatability of a novel scoring system

Background The temporomandibular joint (TMJ) is frequently involved in juvenile idiopathic arthritis (JIA). Diagnostic imaging is necessary to correctly diagnose and evaluate TMJ involvement, however, hitherto little has been published on the accuracy of the applied scoring systems and measurements. The present study aims to investigate the precision of 20 imaging features and five measurements based on cone beam computed tomography (CBCT). Methods Imaging and clinical data from 84 participants in the Norwegian study on juvenile idiopathic arthritis, the NorJIA study, were collected. Altogether 20 imaging features and five measurements were evaluated independently by three experienced radiologists for intra- and interobserver agreement. Agreement of categorical variables was assessed by Fleiss’, Cohen’s simple or weighted Kappa as appropriate. Agreement of continuous variables was assessed with 95% limits of agreement as advised by Bland and Altman. Results “Overall impression of TMJ deformity” showed almost perfect intraobserver agreement with a kappa coefficient of 0.81 (95% CI 0.69–0.92), and substantial interobserver agreement (Fleiss’ kappa 0.70 (0.61–0.78)). Moreover, both “flattening” and “irregularities” of the eminence/fossa and condyle performed well, with intra- and interobserver agreements of 0.66–0.82 and 0.55–0.76, respectively. “Reduced condylar volume” and “continuity” of the fossa/eminence had moderate intra- and interobserver Kappa values, whereas continuity of the condyle had Kappa values above 0.55. Measurements of distances and angles had limits of agreement of more than 15% of the sample mean. Conclusions We propose a CBCT-based scoring system of nine precise imaging features suggestive of TMJ deformity in JIA. Their clinical validity must be tested. Supplementary Information The online version contains supplementary material available at 10.1186/s12903-022-02701-5.


Background
Juvenile idiopathic arthritis (JIA) is an autoimmune condition, which includes all arthritides of unknown origin with onset before 16 years of age and duration more than six weeks [1]. JIA is the most common rheumatic disease of childhood with a prevalence of up to 1-2 per 1000 in developed countries [1].
Both history and clinical findings of TMJ arthritis can be equivocal, emphasising the importance of imaging to detect and monitor active disease [10,11]. Magnetic resonance imaging (MRI) is the preferred modality for evaluation of disk pathology and active TMJ arthritis, with visualisation of joint effusion and a thickened and hyperaemic synovium [12,13]. The osseous structures, on the other hand, are better depicted with radiographic techniques. The tomographic nature of the panoramic technique is susceptible to image distortion, and can only be used as an overview to detect gross deformity [14]. Studies on dry skulls have shown that cone beam computed tomography (CBCT) has better sensitivity and similar reliability for detection of condylar pathology as compared to computed tomography (CT) [15,16]. The effective dose varies greatly depending on the CBCT machine, field of view and applied exposure parameters. However, compared to CT, CBCT was reported to have about 35% lower radiation dose and better subjective image quality [17]. Newer MRI techniques have shown promising results in adults for assessment of TMJ shape, yet, taken together, osseous deformity is currently best assessed with CBCT [18].
The drawbacks with conventional radiography and the increasing importance of detecting early signs of TMJ involvement in JIA has led to increased interest in CBCT, both for diagnosis, follow-up and further research. However, to date, little has been published on the precision and reproducibility of features and measurements used for assessing TMJ deformity in children. For example studies used consensus instead of agreement [6,19], gave incomplete information [20,21], used inappropriate methods [22][23][24] or did not discuss agreement or precision at all [25][26][27][28][29]. This information is, however, key to a correct understanding and clinical use of the findings, as outlined in 1991 by Fryback and Thornbury in their widely cited paper [30].
The aims of the present study were to examine the precision and repeatability of a predefined set of 20 CBCT-based imaging features and five measurements used to describe TMJ deformity, and next to devise a scoring system based on the more robust features.

Methods
The present study is part of the Norwegian JIA study (NorJIA), a prospective, longitudinal observational study performed between 2015 and 2020. Participants in the main study (n = 228) were recruited from three tertiary university hospitals in the Western, Central and Northern Norway Regional Health Authorities. Children aged 4-16 years were included if they met the diagnostic criteria of JIA according to the International League of Associations for Rheumatology (ILAR) Classification [31]. Written informed consent was obtained from all participants and/or their legal guardian/parent. As part of the study protocol all participants in the NorJIA study were referred for a CBCT. The exclusion criterion for the present study was suboptimal examination due to artefacts.
To examine the precision and repeatability of potential imaging features, a subset of CBCT examinations was selected from the NorJIA study population by one of the local radiologists (TAA, MS, XS) at each site. The selection was based on the CBCT report and demographic and clinical information to represent an a priori balanced range of imaging findings where approximately one-third had moderate/severe findings, one-third mild findings and one-third subtle or no findings. Given this prevalence, an expected Kappa coefficient of 0.6, and a precision of ± 0.15 at a confidence level of 90% estimated a sample size of 81 [32]. According to Bland a sample size of 100 for repeat measurements of continuous variables is good-giving 95% CIs about the upper and lower limits of agreement of approximately ± 0.34 × the standard deviation of the differences [33]. The completed checklist for the guidelines for reporting reliability and agreement studies (GRRAS) is found in Additional file 1 [34].

Imaging
The CBCT examinations took place at the Regional Competence Centres for Oral Health. They were conducted by experienced radiographers with the participants positioned in the Frankfort plane horizontal with their teeth in maximal intercuspal position. CBCT model and machine settings are outlined in Table 1.

Image review
Prior to scoring calibration between readers was performed during a number of face-to-face and online meetings to address volume reorientation, identification of landmarks and multiplanar reconstructions. Further, imaging features and their grading were carefully discussed and standardised based on both single images and complete examinations with a particular focus on discrimination between categories. Based on previous literature five measurements (glenoid fossa depth and length, glenoid fossa/articular eminence inclination angle, and condyle length and width) and 20 imaging features describing anatomy and deformity (overall impression of TMJ deformity, condyle volume and position, joint surface continuity, irregularity and flattening of the condyle and glenoid fossa/articular eminence, apposition, ankylosis, heterotopic bone formation, loose joint body and findings suggestive of osteoarthritis) were identified for the present study [6,26,[35][36][37][38][39][40][41][42]. Definitions of volume reorientation, measurements and variables are given in Figs. 1 and 2 and Table 2, respectively.  The examinations were read independently by two radiologists and one dento-maxillofacial radiologist (TAA, OA and CX, with 13, 12 and 14 years of experience in imaging, respectively). The examinations were anonymised for all information except scan date and study site. Image viewing conditions were standardised (diagnostic screens, ambient lighting etc.) and the Digital Imaging and Communications in Medicine (DICOM) image viewers included in the three respective CBCT systems were used (details in Table 1). Adjustment of window-level settings was allowed. After an interval of minimum three weeks the reading was repeated by TAA.
To examine the potential of CBCT to specify the location of pathology, the glenoid fossa and the condyle were scored for irregularities and flattening by segmental scoring and as a whole, in separate sessions. However, preliminary results showed that segmental scoring had very poor agreement [43], and this was therefore omitted from further analysis.

Statistics
Ordinal data are presented as medians (ranges), dichotomous data as proportions and continuous data as medians (IQR) or means (±SD). For categorical variables, intra-and interobserver agreement was assessed with kappa (κ) coefficients (95% confidence interval). Fleiss' kappa was applied for three observers and Cohen's simple or linear weighted kappa for two observers or observations. A κ coefficient of < 0 was considered poor, 0-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial and 0.81-1.00 almost perfect [44]. For continuous variables, intra-and interobserver agreement was assessed by calculating the mean difference and standard deviation (SD) of the differences to establish the 95% limits of agreement (LOA, mean difference ± 1.96*SD) as advised by Bland and Altman [45]. Outliers were removed from final analysis if the value was more than four standard deviations from the mean [46]. Bland-Altman plots are usually informally interpreted, and we set the limit for clinically acceptable agreement (100: sample mean × 95% LOA) at 15%. The mean differences were used as a measure of bias, and considered statistically significant given a p-value < 0.05 (two tailed, one-sample t-test). Proportional bias was assumed if linear regression was statistically significant. To assess the potential impact of different CBCT-systems, intraobserver analyses were repeated stratified by study site for the categorical variables, and a one-way between-groups analysis of variance was conducted for the intraobserver mean differences. All statistical analysis was performed using IBM SPSS version 28 (IBM, Chicago, IL). The level of statistical significance was set at 5% (p-value < 0.05).

Results
84 children (51 girls) with JIA were included (84 CBCT examinations). Median age at CBCT examination was 14.3 years (IQR 4.3), median age at diagnosis was 6.2 years (IQR 8.7) and median disease duration at the time of the CBCT was 6.3 years (IQR 6.3) ( Table 3). Adapted from reference [39] c Definition of condyle/'equator' from reference [40] d Adapted from reference [35] e From reference [35] Imaging features The distribution of scores for each of the CBCT-features are given in Fig. 3 (right TMJ, first reading). Six features were not further analysed due to severely skewed distribution. Absolute, intra-and interobserver agreement for the remaining 14 features are detailed in Table 4. Examples of continuity of the articular surface, irregularities and flattening are shown in Figs. 4, 5, 6, and 7. Assessment of the overall impression of TMJ deformity on a 0-2 scale showed almost perfect agreement for the same reader, with a kappa coefficient of 0.81 (95% CI 0.69-0.92) ( Table 4). The interobserver agreement was substantial, with a Fleiss' kappa coefficient of 0.70 (0.61-0.78). The absolute agreement between three observers was 64 out of 82 (78%) ( Table 4).
There was a substantial to moderate agreement for the assessment of flattening of the articular eminence and glenoid fossa, surface irregularity and continuity of the articular eminence and glenoid fossa on a 0-2 scale, with kappa coefficients of 0.77, 0.66 and 0.46, respectively, for the same reader. The inter-reader agreement was moderate to substantial with kappa coefficients between 0.43 and 0.65 ( Table 4).
Assessment of condylar flattening, both on a sagittal view, 0-3 scale, and on a coronal view, 0-2 scale, showed almost perfect and substantial agreement for the same reader, with kappa coefficients of 0.82 and 0.71, respectively. The inter-reader agreement was substantial and moderate, with kappa coefficients of 0.76 and 0.60 (Table 4). There was a substantial to moderate intra-and  (29) Psoriatic arthritis 2 (2) Enthesitis-related arthritis 9 (11) Undifferentiated arthritis 8 (10) Fig. 3 Distribution of findings, right side, 1st reading. The x-axis denotes number of participants inter-reader agreement for the assessment of condylar surface irregularity, reduced condylar volume and continuity of the condyle, as well as for assessment of secondary degenerative change such as subchondral sclerosis and the presence of osteophytes (Table 4). For assessment of condyle position within the glenoid fossa the interobserver agreement was slight, k = 0.10, and intra-observer agreement fair, k = 0.38. Analysis stratified for study site did not change the results.

Measurements
Measurements and differences resembled normal distribution patterns. The mean glenoid fossa lengths and depths, mean glenoid fossa/articular eminence inclination angles and the mean condylar diameters are given in Table 5. In the final analysis 24 outliers were removed. Bland-Altman plots of differences in all measurements showed relatively wide 95% limits of agreement, varying from 13.6 to 83.8% of the sample means (Table 5).   Fig. 8. The mean differences (bias) for the same, and between observers ranged from 0.0 to 0.2 mm and from 0.0 to 0.4 mm for linear measurements. For the angular measurements the corresponding figures were 0.3-1.8 and 0.2-2.6 degrees. The mean differences were significantly different from zero (p < 0.05) in a number of measurements, as noted in Table 5. Furthermore, there was proportional bias in 17 out of 72 measurement pairs, of which 12 had a positive slope. Examples of Bland-Altman plots without and with bias are given in Additional file 2. In the stratified analysis the mean differences were not significantly different between the study sites in 19 out of 20 measurements.

Discussion
In this study we have identified a set of nine robust CBCT-based image markers suggestive of TMJ deformity in children and adolescents with JIA. These include an overall impression of TMJ deformity, subjective assessment of condylar volume, joint surface continuity, surface irregularity and flattening of the condyle and articular eminence and glenoid fossa. Their clinical validity remains to be determined. Importantly, we also found that measurements of distances and angles performed poorly with wide limits of agreement.
We have shown that the overall impression of TMJ deformity can be reliably scored on CBCT, both for the same and between observers when using the three categories normal, mild or moderate/severe deformity. Our results compare well with those of Stoustrup and colleagues, who, in a study of 47 JIA patients and 19 non-JIA patients examined with a large field of view CBCT to assess associations between condylar changes and facial asymmetry, re-assessed 20 randomly selected patients to examine intra-observer agreement [42]. They found a substantial agreement for both discrimination of normal from pathological condyles and categorization of the pathological condyles as predominately deformed or eroded, with kappa values of 0.67 and 0.63, respectively. The same group later added a fourth category (combined deformation plus erosion) to their scoring system and reported an almost perfect intra-observer agreement (kappa = 0.83) for reassessment of 30 of 245 CBCTs in a long-term follow up examination of the Nordic JIAcohort [4]. Our approach differed in that we assessed surface irregularities, flattening and signs of osteoarthritis separately, for both the glenoid fossa/articular eminence and for the condyle. We used up to four categories, thus allowing for a more detailed evaluation. All assessments performed well, both for the same and between observers.
Moreover our suggested scoring system allows for evaluation of the continuity of the joint surfaces. Importantly, this continuity, or loss of continuity, can be accurately distinguished from irregularities-which we defined as changes of shape more sharply demarcated than flattening, that may be continuous or not. According to the RDC/TMD-criteria [35] a surface erosion is defined as loss of continuity. However, this definition does not correspond well with the provided image examples, where irregularities/bony depressions are shown rather than loss of surface continuity. A distinction similar to ours was most likely included by Arvidsson et al. [6] as "cortical defect with/without sclerotic border", but to the best of our knowledge its agreement has not been previously examined. We speculate that this marker might represent early involvement, not visualised on MRI. This is of interest as it may allow more precise and early monitoring of treatment response. The somewhat less favourable agreement for assessing the temporal versus the mandibular component of the joint might be due to the lower proportion of pathological findings in the glenoid fossa/ articular eminence. Following a consensus process and reliability exercise for three previously suggested MRI-scoring systems, Tolend and colleagues suggested 3-point scales for erosions and for condylar flattening in the oblique sagittal plane [39]. The scales tested in their reliability exercises for these two items were different from the final suggestion. However, in both systems both items met their predefined threshold for acceptable reliability, i.e. average measure intraclass correlation > 0.80 and smallest detectable difference < 30%, also suggesting robustness, yet not directly comparable to kappa coefficients.
We have previously suggested that condylar flattening as assessed from the coronal plane is a robust imaging feature [36]. In the present study on JIA patients, flattening of the condyle could be accurately scored from both the coronal and sagittal views. We acknowledge that a flattened articular eminence/glenoid fossa as a consequence of JIA was suggested decades ago, and has more recently been added to an MRI scoring system, however, this features' agreement has not been examined previously [6,37,41,47].
Condylar volume can be quantified with semiautomated techniques [48]. However, these techniques are more time consuming than subjective assessment. To the best of our knowledge, our study is the first to examine the agreement of subjective grading of reduced condylar volume. Scoring on a 0-4 scale performed well for the same reader, whilst the agreement between three assessors was fair, but still appropriate for clinical use. Perhaps not comparable with grading of pathology, yet relevant for classification of appearance, such as reduced condylar volume, Karlo and coworkers reported substantial interobserver agreement, k = 0.67, for classification of the condyle into one of three types in 210 children examined with CT for reasons not related to rheumatic or TMJ-disease [38].
The RDC/TMD original paper did not report agreement for each of their suggested research diagnostic criteria, but stated substantial agreement (kappa coefficient 0.71, absolute agreement 86%) between three observers for a dichotomous diagnosis of osteoarthritis in 145 joints in adults examined with CT [35]. We did not assess for osteoarthritis as such, but found similar agreement for its acknowledged separate elements subchondral sclerosis and osteophytes. Subchondral cysts and osteophytes occurred rarely, thus weakening our results, and we therefore suggest that osteoarthritis is scored dichotomized in accordance with the RCD/TMD criteria [35].
Ikeda and Kawamura reported an association between disc displacement and the position of the condyle within the glenoid fossa based on measurements of the joint space [49]. We found that subjective assessment of condylar position within the glenoid fossa on a 7-position scale performed poorly and is not sufficiently accurate for clinical use. Moreover, contrary to Ikeda and Kawamura, we found measurements of angles and small structures too imprecise for clinical use, with wide variation within and between observers and LOA outside our predefined limit of 15% of the sample means.
Our results are in line with those of Kellenberger et al. [37], in a MR study of 18 adolescents with anterior disc displacement and 18 patients with JIA. They found, based on a subset of 12 TMJs in 6 patients, a mean difference of 0.05 mm between two observers measuring the glenoid fossa depth, with wide limits of agreement (LOA -1.35 to 1.46) corresponding to 55% of the sample mean. Corresponding figures for the glenoid fossa/articular eminence inclination angle were 1.5 degrees and 55% of the sample mean, respectively. Similarly, Karlo and colleagues retrospectively measured the anteroposterior and mesiolateral diameter of the condyles on CT examinations in 210 children [38]. They reported wide interobserver LOA for both mesiolateral (− 2.8 to 2.0 mm) and anteroposterior diameter (− 2.0 to 1.6 mm), or 49% and 34% of the respective sample means. We believe that a substantial part of the observed variation is a reflection of the inherent variability in patient positioning, scan orientation during acquisition or volume reorientation prior to image review, in addition to the difficulties in defining the exact measurement points, as also noted by others [50].
Given our results, we suggest a novel, CBCT based scoring system for future studies based on the most robust features identified (Table 6). To the best of our knowledge, this is the first comprehensive scoring system that has been established for CBCT in the assessment of TMJ pathology in JIA.

Limitations and strengths
There are limitations to our study. Firstly, there is the subjective nature of developing any scoring system with inherent biases in readers understanding of what to score. To overcome this, several calibration meetings were held and an atlas depicting examples with different scores was established. Secondly, we were unable to include CBCT of healthy volunteers for ethical reasons. Of note approximately half the joints were scored with no or minor pathology. Thirdly, there are limitations to the use of Cohen's kappa, particularly in datasets with skewed distributions [51]. We therefore present the proportion agreement and the distribution of scores to increase the transparency of the results. Fourthly, we chose to use only two or three observers to assess the scoring system's potential given optimal conditions, rather than in a clinical setting. Fifthly, we acknowledge that despite efforts to obtain a balanced material, we were unable to properly assess the precision of rarely occurring imaging features of JIA, like subchondral pseudocyst, osteophytes, ankylosis, apposition, heterotopic calcification and loose joint body, due to their very low prevalence and even absence in our material. Finally, the CBCT machines were from different vendors between the sites, and hence the scan parameters had to be adapted locally. The differences in spatial resolution could potentially have influenced the ability to discriminate/detect discrete pathological findings and landmarks, however, this issue was thoroughly addressed during the calibration sessions and stratified analysis showed no relevant differences between study sites.
The strengths of this study include a large sample of children and adolescents with JIA, a meticulous standardization including construction of an atlas, the examination of both precision and repeatability and the use of image volumes instead of single images or stacks of images, thus reflecting the entire process of reading images. We included a breadth of pathology and severity of imaging features which were assessed by a scoring system that tested a wide variety of CBCT features of JIA.

Conclusions
We have identified a set of robust CBCT-features suggestive of TMJ deformity in children with JIA, upon which a novel scoring system is proposed.